Inspired by
TED talk of Hans Rosling: Let my dataset change your mindset https://www.ted.com/talks/hans_rosling_at_state#t-1176407
Hadley Wickham Managing many models with R https://www.youtube.com/watch?v=rz3_FDVt9eg
R-squared is a statistical measure of how close the data are to the fitted regression line
# VHS course "R"Kenntnisse um Wissen aus Daten zu gewinnen 2017
library(gapminder)
library(tidyverse)
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
tidyverse_packages()
[1] "broom" "dplyr" "forcats" "ggplot2" "haven" "httr"
[7] "hms" "jsonlite" "lubridate" "magrittr" "modelr" "purrr"
[13] "readr" "readxl" "stringr" "tibble" "rvest" "tidyr"
[19] "xml2" "tidyverse"
Lets see what is in the data
summary(gapminder)
country continent year lifeExp pop
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
(Other) :1632
gdpPercap
Min. : 241.2
1st Qu.: 1202.1
Median : 3531.8
Mean : 7215.3
3rd Qu.: 9325.5
Max. :113523.1
gapminder %>% filter(gdpPercap < 300)
# Print xy plot -----------------------------------------------------------
gapminder <- gapminder %>% mutate(year1950 = year -1950)
ggplot(gapminder, aes(x=year, y=lifeExp)) +geom_line(aes(group = country))
lets transform the data so that we keep all of a country’s data in one row
# Nested data -------------------------------------------------------------
by_country <- gapminder %>%
group_by(continent, country) %>%
nest()
by_country # show content of df
str(by_country[1:3,]) # show structure of first 3 rows of grouped and nested df
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 3 variables:
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1
$ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3
$ data :List of 3
..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 12 obs. of 5 variables:
.. ..$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
.. ..$ lifeExp : num 28.8 30.3 32 34 36.1 ...
.. ..$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
.. ..$ gdpPercap: num 779 821 853 836 740 ...
.. ..$ year1950 : num 2 7 12 17 22 27 32 37 42 47 ...
..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 12 obs. of 5 variables:
.. ..$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
.. ..$ lifeExp : num 55.2 59.3 64.8 66.2 67.7 ...
.. ..$ pop : int 1282697 1476505 1728137 1984060 2263554 2509048 2780097 3075321 3326498 3428038 ...
.. ..$ gdpPercap: num 1601 1942 2313 2760 3313 ...
.. ..$ year1950 : num 2 7 12 17 22 27 32 37 42 47 ...
..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 12 obs. of 5 variables:
.. ..$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
.. ..$ lifeExp : num 43.1 45.7 48.3 51.4 54.5 ...
.. ..$ pop : int 9279525 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373 29072015 ...
.. ..$ gdpPercap: num 2449 3014 2551 3247 4183 ...
.. ..$ year1950 : num 2 7 12 17 22 27 32 37 42 47 ...
by_country$data[[1]] # show data of first row
by_country$data[1] # show data of first row
[[1]]
by_country$data[[1]][[2]] # show contents of second column
[1] 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763 42.129
[12] 43.828
by_country$data[[1]][2] # show second column
a good explanation can be found at: http://r4ds.had.co.nz/vectors.html#lists
The column data is a list of data frames. Therefore, we have now a row per country and all data of that country in a dataframe in one column.
In a grouped dataframe each row is an oberservation, in a nested dataframe each row is a group, in this case, a group of a country’s observations.
# Fit models --------------------------------------------------------------
country_model <- function(df){
lm(lifeExp ~ year1950, data=df) # use year1950 because the absolute value is not important
}
models <- by_country %>%
mutate(
model = map(data, country_model)
)
models
lm(lifeExp ~ year1950, data=by_country$data[[1]])
Call:
lm(formula = lifeExp ~ year1950, data = by_country$data[[1]])
Coefficients:
(Intercept) year1950
29.3566 0.2753
models$model[1]
[[1]]
Call:
lm(formula = lifeExp ~ year1950, data = df)
Coefficients:
(Intercept) year1950
29.3566 0.2753
# Put it all together -----------------------------------------------------
by_country <- gapminder %>%
group_by(continent, country) %>%
nest() %>%
mutate(
model = map(data, country_model)
)
by_country
models <- models %>%
mutate(
glance = map(model, broom::glance), # Construct a single row summary "glance" of a model
rsq = glance %>% map_dbl("r.squared"), # note the pipe within mutate(...)
tidy = map(model, broom::tidy), # Tidy the result of a test into a summary data.frame
augment = map(model, broom::augment)
)
models
models$glance[1]
[[1]]
models$rsq[1]
[1] 0.9477123
models$tidy[1]
[[1]]
models$augment[1]
[[1]]
NA
models %>%
ggplot(aes(rsq, country)) +
geom_point(aes(colour = continent))
# source("gapminderShiny.R")
is the plot clear? how could it be improved?
ggplot orders categorical variables alphabetically
models %>%
ggplot(aes(rsq, reorder(country, rsq))) +
geom_point(aes(colour = continent))
models %>% filter((rsq<0.1 & rsq>0)) %>% unnest(rsq) %>% top_n(6,rsq) %>% unnest(data) %>%
ggplot(aes(year, lifeExp)) +
geom_line(aes( alpha = 1/3)) +
facet_wrap(~country)
NA
# Unnest data -------------------------------------------------------------
unnest(models, data)
unnest(models, glance, .drop = TRUE)
unnest(models, tidy)
# Plot data frame ---------------------------------------------------------
plotLife <- models %>%
unnest(tidy) %>%
select(continent, country, term, estimate, rsq) %>%
spread(term, estimate) %>%
ggplot(aes(`(Intercept)`,year1950))+
geom_point(aes(colour = continent, size = rsq, fill = country)) +
geom_smooth(se=FALSE) +
xlab("Life expectancy (1950)") +
ylab("Yearly improvement") +
scale_size_area() + guides(fill=FALSE)
ggplotly(plotLife, tooltip = c("year1950", "country"))
`geom_smooth()` using method = 'loess'
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).